RAPToR - ShowcaseThis vignette showcases different uses for RAPToR.
Including estimated age as a covariate in Differential Expression (DE) analysis can substantially reduce previously unexplained variation between samples. In this example, we show that even when chronological age of different time points is known, using estimated age results in better model fits and consequently better power to detect DE genes.
Lehrbach et al. (2012) profiled C. elegans wild-type (wt) and pash-1(mj100) mutants (mut) at 4 time points (0, 6, 12, and 24 hours past the L4 stage), each in triplicate (Accession: E-MTAB-1333, dslehrbach2012).
Given the time-series experimental design, searching for Differentially Expresses (DE) genes between strains requires including development in the model. While true in any time series context, it is particularly important here as late-larval development of C. elegans is known for drastic changes in gene expression within short time frames (Snoek et al. (2014)).
Using identical DE models with either chronological age or RAPToR age estimates as covariates, we can determine which of the two is the best predictor and thus gives more power to detect differential expression.
Code to generate the dslehrbach2012 object can be found at the end of this section.
library(RAPToR)
library(wormRef)
library(stats)
library(parallel)
library(limma)
We start by applying a quantile-normalization and \(log(X+1)\) transformation to the data.
dslehrbach2012$g <- limma::normalizeBetweenArrays(dslehrbach2012$g,
method = "quantile")
dslehrbach2012$g <- log1p(dslehrbach2012$g)
Next, we select a reference and stage the samples. Since sample (chronological) age ranges from the fourth larval molt (L4) to 24h past L4, we select Cel_YA_2 from wormRef that covers this range.
# load reference
r_ya <- prepare_refdata("Cel_YA_2", "wormRef", 400)
# estimate sample age
ae_lerhbach <- ae(dslehrbach2012$g, r_ya)
#> nb.genes
#> refdata 15710
#> samp 17441
#> intersect.genes 14952
#> Bootstrap set size is 4984
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
We can see quite a lot of variation in the sample timings, especially at the 0- and 6-hour time points.
Another curious effect can be noted, easier to see when grouping the samples by batch. In the first two replicates, mutants are systematically (and statistically significantly) older than WT, while the opposite effect is true for the third replicate.
# compute age difference between wt & mut at each time point
isWT <- dslehrbach2012$p$strain=="wt"
ae_dif <- ae_lerhbach$age.estimates[isWT,1] - ae_lerhbach$age.estimates[!isWT,1]
# test for effect significance with a linear model
batch <- dslehrbach2012$p$rep[isWT]
summary(lm(ae_dif~batch))
#>
#> Call:
#> lm(formula = ae_dif ~ batch)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4889 -0.3440 0.0491 0.7862 2.4079
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.1130 0.7867 -2.686 0.0250 *
#> batch2 -1.0811 1.1126 -0.972 0.3566
#> batch3 3.3169 1.1126 2.981 0.0154 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.573 on 9 degrees of freedom
#> Multiple R-squared: 0.6535, Adjusted R-squared: 0.5765
#> F-statistic: 8.486 on 2 and 9 DF, p-value: 0.008487
Gene expression depends so strongly on development that correlation between samples is only predicted by their age difference, rather than by their genetic background.
The graph below shows correlation between all possible pairs of samples, with no clear effect of strain difference.
cc <- cor(dslehrbach2012$g, dslehrbach2012$g, method = "spearman")
We will use limma to perform a DE analysis with the following model:
\[ X \sim strain \times \mathtt{ns(age, df = 2)} \] where \(\mathtt{age}\) will either be chronological or estimated age, and the spline \(\mathtt{ns()}\) will be used to handle non-linear expression dynamics along time.
To find differential expression of genes along development or between strains translates to comparing the following nested models: (2) vs. (1) for development, (3) vs. (1) for strain.
\[ \begin{aligned} Y & = \beta_0 + \beta_1 I_{strain} + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) + (\gamma_1 I_{strain} age_{sp1} + \gamma_2 I_{strain} age_{sp2})\\ Y & = \beta_0 + \beta_1 I_{strain}\\ Y & = \beta_0 + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) \end{aligned} \] Genes will be considered DE with Benjamini-Holm adjusted p-values \(< 0.05\) for the corresponding coefficients.
Note: for the purpose of this vignette, we don’t consider whether genes are up- or down-regulated, but only if a significant effect is detected.
We now run the same DE analysis using either chronological age or RAPToR age estimates as predictor, and compare results to determine if there is an improvement. Code for DGE() can be found at the end of this section.
# format gdata as log2 for limma input
X <- log2(exp(dslehrbach2012$g))
# with chron. age
dge.ca <- DGE(X = X, strain = dslehrbach2012$p$strain,
age = dslehrbach2012$p$tpastL4,
name = "dge.ca", return.model = T)
#> Loading required package: splines
# with RAPToR ae
dge.ae <- DGE(X = X, strain = dslehrbach2012$p$strain,
age = ae_lerhbach$age.estimates[,1],
name = "dge.ae", return.model = T)
Estimated age shows a clear increase in performance over chronological age when comparing adjusted p-values of each gene for detection of an effect. Red bars in the plots above correspond to the \(0.05\) threshold for significance and color-coded gene counts for each quadrant are indicated in the bottom-right.
# compute nb. of DE genes for mutation
mut_dif.ca <- sum(dge.ca$tmut$adj.P.Val < 0.05)
mut_dif.ae <- sum(dge.ae$tmut$adj.P.Val < 0.05)
# compute nb. of DE genes for development
age_dif.ca <- sum(dge.ca$tage$adj.P.Val < 0.05)
age_dif.ae <- sum(dge.ae$tage$adj.P.Val < 0.05)
# percentage increase of mutation DE genes
100 * (mut_dif.ae - mut_dif.ca) / (mut_dif.ca) # mut pct increase
#> [1] 96.02237
# percentage increase of development DE genes
100 * (age_dif.ae - age_dif.ca) / (age_dif.ca) # age pct increase
#> [1] 7.674787
Indeed, we detect nearly twice as many DE genes for strain using RAPToR age estimates compared to chronological age, and around \(8\%\) more genes with development. Furthermore, the large proportion of genes changing with development (\(57 - 62\%\)) compared to strain (\(7-16\%\)) also shows how crucial it is to properly model developmental dynamics in DE analyses.
The increase in detected DE genes is partially explained by better model fits, shown below by the goodness-of-fit (GoF) distribution across genes. The goodness-of-fit (GoF) computed is an \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\) per gene.
If the asynchronicity that RAPToR detects between samples is erroneous, then adding random noise around the chronological age values should yield similar results to using our age estimates in the model. To test this, we simulate age sets with added noise of similar distribution to the age differences observed between chronological and estimated age.
set.seed(10) # for reproducibility
age_diffs <- (dslehrbach2012$p$tpastL4 + 50) - ae_lerhbach$age.estimates[,1]
# Note : we add 50 to shift tpastL4 to the age values and avoid negative
# tpastl4 values. This has no impact on the DE analysis.
# estimate density function of age_diffs
d_ad <- density(age_diffs)
# generate 100 age sets of with random age_diffs-like noise
n <- 100
rd_ages <- lapply(seq_len(n), function(i){
(dslehrbach2012$p$tpastL4 + 50) +
sample(x = d_ad$x, size = nrow(dslehrbach2012$p),
prob = d_ad$y, replace = T)
})
As done above, we run identical DE models and compare results with the model goodness-of-fit (GoF) per gene and look at the number of DE genes found for strain and development (BH-adjusted p-value \(< 0.05\)).
# setup cluster for parallelization
cl <- parallel::makeCluster(6, "FORK")
# do DGE on all age sets
rd_dges <- parLapply(cl, seq_len(n), function(i){
cat("\r", i,"/",n)
DGE(X, dslehrbach2012$p$strain, rd_ages[[i]], name = paste0("rd.",i))
})
stopCluster(cl)
gc()
# get quantiles of GoF for plotting
qts <- seq(0,1, length.out = 100)
quants <- lapply(seq_len(n), function(i){
quantile(rd_dges[[i]]$gof, probs = qts)
})
quants <- do.call(rbind, quants)
We can see that using estimated age systematically leads to better model fits and increased detection of DE genes both for strain and development.
Gene GoF using estimated age and binned by chronological age GoF quantiles (below, left) also clearly shows that for the same genes, we tend to have better fits with estimated age, while that is not the case when adding noise.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("biomaRt", quietly = T) # bioconductor
requireNamespace("limma", quietly = T) # bioconductor
requireNamespace("affy", quietly = T) # bioconductor
requireNamespace("gcrma", quietly = T) # bioconductor
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
To generate dslehrbach2012:
# get probe ids from the arrayexpress
probe_ids <- read.table(
"https://www.ebi.ac.uk/arrayexpress/files/A-AFFY-60/A-AFFY-60.adf.txt",
h=T, comment.char = "", sep = "\t", skip = 17, as.is = T)
# pheno data
p_url <- paste0("https://www.ebi.ac.uk/arrayexpress/files/",
"E-MTAB-1333/E-MTAB-1333.sdrf.txt")
p <- read.table(p_url, h = T, sep = "\t", comment.char = "", as.is = T)
# geno data
g_url <- unique(p$Comment..ArrayExpress.FTP.file.)
g_dir <- paste0(data_folder, "dslehrbach2012_raw/")
g_file <- paste0(g_dir, "raw.zip")
if(!file.exists(g_file)){
dir.create(g_dir)
utils::download.file(g_url, destfile = g_file)
}
utils::unzip(g_file, exdir = g_dir)
g <- affy::ReadAffy(filenames = p$Array.Data.File,
celfile.path = g_dir, phenoData = p)
g <- affy::expresso(g, bg.correct = F, normalize = F,
pmcorrect.method = "pmonly", summary.method = "median")
g <- 2^exprs(g) # expresso log2s the data
colnames(g) <- p$Source.Name
# format ids
g <- RAPToR::format_ids(g, probe_ids, from = 1, to = 7)
g <- RAPToR::format_ids(g, wormRef::Cel_genes, from = 3, to = 1)
# filter relevant fields
p <- p[, c(1,22,24)]
colnames(p) <- c("title", "tpastL4", "strain")
p$strain <- factor(p$strain,
levels = c('wild type', 'pash-1(mj100)'),
labels = c('wt', 'strain'))
p$rep <- factor(gsub("\\w+_h\\d+\\.(\\d)","\\1", p$title))
dslehrbach2012 <- list(g = g, p = p)
save(dslehrbach2012,
file = paste0(data_folder, "dslerhbach2012.RData"), compress = "xz")
# cleanup
file.remove(g_file)
unlink(g_dir, recursive = T)
rm(g, g_url, g_dir, g_file, p, p_url, probe_ids)
Functions to compute predictions (pred_lmFit()) and Goodness-of-fit (GoF()) from limma models.
pred_lmFit <- function(Fit){
# Predictions from a limma model
tcrossprod(Fit$coefficients, Fit$design)
}
GoF <- function(Fit, X){
# Compute Goodness of Fit
pred <- pred_lmFit(Fit)
res <- (X - pred)
ss <- apply(X, 1, function(ro) sum((ro - mean(ro))^2))
Rsq <- sapply(seq_len(nrow(X)), function(i){
1 - sum(res[i,]^2)/ss[i]
})
return(Rsq)
}
Function to run DE analysis with limma, as described in Differential Expression analysis (DGE()).
DGE <- function(X, strain, age, df = 2, name = NULL, return.model = FALSE){
require(splines)
if(! length(strain) == ncol(X) | ! length(age) == ncol(X))
stop("strain and age must be of length ncol(X).")
# make pdat df
pdat <- data.frame(strain = factor(strain),
age = as.numeric(age),
row.names = colnames(X))
# build design matrix
d <- model.matrix(~ 1 + ns(age, df = df) * strain, data = pdat)
# fix colnames
colnames(d) <- c("b0", paste0(rep("a", df), 1L:df),
"strainmut", paste0(rep("strainmut.a", df), 1L:df))
# build contrast matrices for mut and age tests
if(df == 2){
cm.mut <- makeContrasts(mut = strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
levels = d)
cm.age <- makeContrasts(a1, a2,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
levels = d)
}
if(df == 3){
cm.mut <- makeContrasts(mut = strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
mut.i3 = strainmut.a3,
levels = d)
cm.age <- makeContrasts(a1, a2, a3,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
a.i3 = strainmut.a3,
levels = d)
}
# fit model
m.0 <- lmFit(object = X, design = d)
# get GoF
gof <- GoF(Fit = m.0, X = X)
# find DE genes for mut
m.m <- contrasts.fit(m.0, contrasts = cm.mut)
m.m <- eBayes(m.m)
# find DE genes for age
m.a <- contrasts.fit(m.0, contrasts = cm.age)
m.a <- eBayes(m.a)
Tmut <- topTable(m.m, adjust.method = "BH",
number = Inf,
sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
Tage <- topTable(m.a, adjust.method = "BH",
number = Inf,
sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
res <- list(gof = gof,
tmut = Tmut,
tage = Tage,
name = name)
if(return.model)
res$model = m.0
rm(m.0, m.m, m.a, Tmut, Tage, d, X, pdat, gof)
gc(verbose = F)
return(res)
}
Your organism of interest may not be well-studied or have an abundance of reference time-series data available. However, RAPToR still works when using a close organism as a reference, thanks to the conserved nature of developmental processes across species (particularly in early development).
Indeed, samples can be staged cross-species using ortholog genes.
Li et al. (2014) defined a set of orthologs between D. melanogaster and C. elegans (hereafter, glist), which we will use to stage C. elegans single embryos on a D. melanogaster reference (of note, ensembl ortholog sets also work).
The C. elegans time-series of single-embryos was profiled and published by Levin et al. (2016) (Accession : GSE60755, dslevin2016cel).
The Drosophila melanogaster embryonic development time-series is part of the modENCODE project and published by Graveley et al. (2011) (data downloaded from fruitfly.org, dsgraveley2011)
Code to generate glist (orthologs), dslevin2016cel (C. elegans data), and dsgraveley2011 (D. melanogaster data) can be found at the end of this section.
library(RAPToR)
library(drosoRef)
library(limma)
library(stats)
We start by applying a quantile-normalization and \(log(X+1)\) transformation to the data.
dsgraveley2011$g <- limma::normalizeBetweenArrays(dsgraveley2011$g,
method = "quantile")
dsgraveley2011$g <- log1p(dsgraveley2011$g)
dslevin2016cel$g <- limma::normalizeBetweenArrays(dslevin2016cel$g,
method = "quantile")
dslevin2016cel$g <- log1p(dslevin2016cel$g)
4 outliers in the C. elegans data are then removed from the analysis (see About the outliers).
# outlier samples (see below)
rem <- c(sample_0001 = 53L, sample_0002 = 54L,
sample_0003 = 58L, sample_0004 = 59L)
To stage samples we must build a reference from the Drosophila data, which happens to be the Dme_embryo reference of the drosoRef package.
It can thus be directly loaded with prepare_refdata().
# reference built from dsgraveley2011
r_grav <- prepare_refdata("Dme_embryo", "drosoRef", 500)
We then convert the C. elegans gene IDs to their D. melanogaster orthologs, and stage the C. elegans embryos with the Drosophila reference. In this case, many-to-one orthologs are averaged and one-to-many are assigned to the first ID match.
dslevin2016cel$g_dmel <- format_ids(dslevin2016cel$g, glist,
from = "wb_id", to = "fb_id")
#> Kept 5714 out of 20168 - aggregated into 4214
ae_cel_on_dmel <- ae(dslevin2016cel$g_dmel, r_grav)
#> nb.genes
#> refdata 12408
#> samp 4214
#> intersect.genes 3194
#> Bootstrap set size is 1065
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
The gaps we see in the staging results are likely at timings where there are incompatible expression dynamics between the two species.
By re-building the Drosophila reference on the first 2 components which are broad or monotonic, we can keep only these expression dynamics of development in the reference. This can improve staging because it filters out the incompatible dynamics.
# build same model, but restrict interpolation to 2 components instead of 8
m_grav2 <- ge_im(X = dsgraveley2011$g, p = dsgraveley2011$p,
formula = "X ~ s(age, bs = 'cr')", nc = 2)
# make adjusted reference object
r_grav2 <- make_ref(m_grav2, n.inter = 500,
t.unit = "h past egg-laying",
metadata = list("organism"="D. melanogaster"))
We then stage the C. elegans embryos on this adjusted reference.
ae_cel_on_dmel2 <- ae(dslevin2016cel$g_dmel, r_grav2)
#> nb.genes
#> refdata 12300
#> samp 4214
#> intersect.genes 3143
#> Bootstrap set size is 1048
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
We removed 4 outlier samples from the analyses above because they have an erroneous noted age. This is evidenced by their position on principal components, where they appear to be around 2 hours “older” than their specified age.
pca_cel <- stats::prcomp(t(dslevin2016cel$g), rank = 10,
center = TRUE, scale = FALSE)
This is even further confirmed by their estimated age on the Drosophila reference, which places them as expected.
Required packages and variables:
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # bioconductor
requireNamespace("Biobase", quietly = T) # bioconductor
requireNamespace("biomaRt", quietly = T) # bioconductor
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
raw2tpm <- function(rawcounts, genelengths){
if(nrow(rawcounts) != length(genelengths))
stop("genelengths must match nrow(rawcounts).")
x <- rawcounts/genelengths
return(t( t(x) * 1e6 / colSums(x) ))
}
fpkm2tpm <- function(fpkm){
return(exp(log(fpkm) - log(colSums(fpkm)) + log(1e6)))
}
To build glist, ortholog genes between C. elegans and D. melanogaster from Li et al. (2014) Supplementary Table 1:
tmp_file <- paste0(data_folder, "dmel_cel_orth.zip")
tmp_fold <- paste0(data_folder, "dmel_cel_orth/")
f_url <- paste0("https://genome.cshlp.org/content/suppl/2014/05/15/",
"gr.170100.113.DC1/Supplemental_Files.zip")
utils::download.file(url = f_url, destfile = tmp_file)
utils::unzip(tmp_file, exdir = tmp_fold)
glist <- read.table(
paste0(tmp_fold,
"Supplementary\ files/TableS1\ fly-worm\ ortholog\ pairs.txt"),
skip = 1, h=T, sep = "\t", as.is = T, quote = "\""
)
colnames(glist) <- c("fb_id", "dmel_name", "cel_id", "cel_name")
glist$wb_id <- wormRef::Cel_genes[
match(glist$cel_id, wormRef::Cel_genes$sequence_name), "wb_id"]
save(glist, file = paste0(data_folder, "sc2_glist.RData"), compress = "xz")
# cleanup
file.remove(tmp_file)
unlink(tmp_fold, recursive = T)
rm(tmp_file, tmp_fold, f_url)
To build dsgraveley2011, first get drosophila genes from ensembl:
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"transcript_end",
"transcript_start"),
mart = mart)
droso_genes$transcript_length <-
droso_genes$transcript_end - droso_genes$transcript_start
droso_genes <- droso_genes[,c(1:3,6)]
colnames(droso_genes)[1:3] <- c("fb_id", "transcript_id", "gene_name")
rm(mart)
Then, download D. melanogaster data from Graveley et al. (2011) :
g_url_dsgraveley2011 <- paste0(
"ftp://ftp.fruitfly.org/pub/download/modencode_expression_scores/",
"Celniker_Drosophila_Annotation_20120616_1428_allsamps_",
"MEAN_gene_expression.csv.gz")
g_file_dsgraveley2011 <- paste0(data_folder, "dsgraveley2011.csv.gz")
utils::download.file(g_url_dsgraveley2011, destfile = g_file_dsgraveley2011)
X_dsgraveley2011 <- read.table(gzfile(g_file_dsgraveley2011),
sep = ',', row.names = 1, h = T)
# convert gene ids to FBgn
X_dsgraveley2011 <- RAPToR::format_ids(X_dsgraveley2011, droso_genes,
from = "gene_name", to = "fb_id")
# select embryo time series samples
X_dsgraveley2011 <- X_dsgraveley2011[,1:12]
P_dsgraveley2011 <- data.frame(
sname = colnames(X_dsgraveley2011),
age = as.numeric(gsub("em(\\d+)\\.\\d+hr", "\\1",
colnames(X_dsgraveley2011))),
stringsAsFactors = FALSE)
dsgraveley2011 <- list(g = X_dsgraveley2011, p = P_dsgraveley2011)
save(dsgraveley2011,
file = paste0(data_folder, "dsgraveley2011.RData"), compress = "xz")
# cleanup
file.remove(g_file_dsgraveley2011)
rm(g_url_dsgraveley2011, g_file_dsgraveley2011,
X_dsgraveley2011, P_dsgraveley2011)
To build dslevin2016cel, C. elegans single-embryo data from Levin et al. (2016)
geo_dslevin2016cel <- "GSE60755"
g_url_dslevin2016cel <- GEOquery::getGEOSuppFiles(geo_dslevin2016cel,
makeDirectory = FALSE,
fetch_files = FALSE)
g_file_dslevin2016cel <- paste0(data_folder, "dslevin2016cel.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016cel$url[1]),
destfile = g_file_dslevin2016cel)
X_dslevin2016cel <- read.table(gzfile(g_file_dslevin2016cel),
h = T, sep = '\t', as.is = T, row.names = 1,
comment.char = "")
# filter poor quality samples
cm_dslevin2016cel <- RAPToR::cor.gene_expr(X_dslevin2016cel, X_dslevin2016cel)
f_dslevin2016cel <- which(0.67 > apply(cm_dslevin2016cel, 1,
quantile, probs = .99))
X_dslevin2016cel <- X_dslevin2016cel[, -f_dslevin2016cel]
# convert to tpm & FBgn
X_dslevin2016cel <- X_dslevin2016cel[
rownames(X_dslevin2016cel)%in%wormRef::Cel_genes$sequence_name,]
X_dslevin2016cel <- raw2tpm(
rawcounts = X_dslevin2016cel,
genelengths = wormRef::Cel_genes$transcript_length[
match(rownames(X_dslevin2016cel), wormRef::Cel_genes$sequence_name)]
)
X_dslevin2016cel <- RAPToR::format_ids(X_dslevin2016cel, wormRef::Cel_genes,
from = "sequence_name", to = "wb_id")
# pheno data
P_dslevin2016cel <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016cel,
getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016cel <- P_dslevin2016cel[
, c("title", "geo_accession", "time point (minutes after 4-cell):ch1")]
colnames(P_dslevin2016cel)[3] <- "time"
P_dslevin2016cel$title <- as.character(P_dslevin2016cel$title)
P_dslevin2016cel <- P_dslevin2016cel[
P_dslevin2016cel$title %in% colnames(X_dslevin2016cel),]
# formatting
P_dslevin2016cel$age <- as.numeric(P_dslevin2016cel$time) / 60
P_dslevin2016cel <- P_dslevin2016cel[order(P_dslevin2016cel$age),]
X_dslevin2016cel <- X_dslevin2016cel[, P_dslevin2016cel$title]
P_dslevin2016cel$title <- gsub('Metazome_CE_timecourse_', '',
P_dslevin2016cel$title)
colnames(X_dslevin2016cel) <- P_dslevin2016cel$title
# remove extra outlier
X_dslevin2016cel <- X_dslevin2016cel[,-127]
P_dslevin2016cel <- P_dslevin2016cel[-127,]
dslevin2016cel <- list(g = X_dslevin2016cel, p = P_dslevin2016cel)
save(dslevin2016cel,
file = paste0(data_folder, "dslevin2016cel.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016cel)
rm(geo_dslevin2016cel, g_url_dslevin2016cel, g_file_dslevin2016cel,
f_dslevin2016cel, cm_dslevin2016cel, X_dslevin2016cel, P_dslevin2016cel)
In some species, tissues can develop at different rates, and these rates can vary between individuals. In C. elegans, Perez et al. (2017) have shown such developmental heterochrony between soma and germline tissues.
By restricting the genes RAPToR uses for staging to those associated with (or expressed in) specific tissues, it is possible to estimate development specific to these tissues.
Rockman, Skrovanek, and Kruglyak (2010) published microarray profiling of 208 recombinant inbred lines of C. elegans N2 and Hawaii (CB4856) strains (Accession: GSE23857, dsrockman2010).
These 208 samples were described as “developmentally synchronized” in the original article.
However, Francesconi and Lehner (2014) later showed there is significant developmental spread between samples, spanning around 20 hours of 20°C late-larval development, essentially making this dataset a very high-resolution time course.
We will start by staging samples with all available genes to get their “global age”.
Then, we stage samples a second time, but restricting the input to a germline set of 2554 genes by combining the germline_intrinsic, germline_oogenesis_enriched, and germline_sperm_enriched categories defined in Perez et al. (2017) (based on previous work by Reinke et al. (2004)). This will give us a germline age for the samples.
Staging the samples a third time, with a soma set of 2718 genes from the osc (molting) gene category defined in Hendriks et al. (2014), will give us the soma age.
Finally, we evaluate the results of staging using principal (PCA) or independent (ICA) components, that summarize the expression dynamics of the data. For example, we can expect that components corresponding to particular tissues will have noticeably cleaner dynamics along their respective age estimates (e.g. oscillatory molting dynamics would be less noisy with soma age than germline age).
Code to generate dsrockman2010 (RIL expression data) and gsubset (germline/soma gene sets) can be found at the end of this section
library(RAPToR)
library(wormRef)
library(ica)
library(stats)
library(limma)
# for plotting
library(vioplot)
We start by applying a quantile-normalization and \(log(X+1)\) transformation to the data.
dsrockman2010$g <- limma::normalizeBetweenArrays(dsrockman2010$g,
method = "quantile")
dsrockman2010$g <- log1p(dsrockman2010$g)
Then, we select an appropriate C. elegans reference (larval to young-adult), and stage the samples.
r_ya <- prepare_refdata("Cel_YA_1", "wormRef", n.inter = 400)
ae_dsrockman2010 <- ae(dsrockman2010$g, r_ya)
#> nb.genes
#> refdata 16327
#> samp 14440
#> intersect.genes 13288
#> Bootstrap set size is 4429
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
Our estimates confirm the wide developmental spread and match the previously inferred ages by Francesconi and Lehner (2014).
Expression dynamics in the data can be observed through ICA components, and enrichment in component gene loadings (or contributions) allows us to associate them to specific processes.
We find out how many independent components (IC) to extract from how many principal components are needed to explain \(90\%\) of the variance in the data.
pca_rock <- summary(stats::prcomp(t(dsrockman2010$g),
rank = 30, center = TRUE, scale = FALSE))
nc <- sum(pca_rock$importance["Cumulative Proportion",] < .90) + 1
nc
#> [1] 48
Then, we perform an ICA extracting the 48 components.
ica_rock <- ica::icafast(t(scale(t(dsrockman2010$g), center = T, scale = F)),
nc = nc)
We can plot the first few independent components along global estimated age of the samples :
It seems that past IC7, components don’t have a definite link to developmental processes, which is expected given the (intended) genetic background heterogeneity in the samples.
We can have a closer look at components clearly capturing development, and the enrichment of their loadings for the gene sets of interest.
dev_comps <- 1:7
From the gene loadings above, we can establish that
Now, we stage the samples using only germline or soma gene subsets.
ae_soma <- ae(
dsrockman2010$g[rownames(dsrockman2010$g) %in% gsubset$soma,],
r_ya)
#> nb.genes
#> refdata 16327
#> samp 2005
#> intersect.genes 1815
#> Bootstrap set size is 605
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
ae_germline <- ae(
dsrockman2010$g[rownames(dsrockman2010$g) %in% gsubset$germline,],
r_ya)
#> nb.genes
#> refdata 16327
#> samp 1856
#> intersect.genes 1666
#> Bootstrap set size is 555
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
We notice the soma estimates of two samples (marked with \(\small{\triangle}\)) are quite off from their global or germline age. This caused by the combination of both the fact we used a small gene set for inferring age, and that very similar expression profiles can occur at different times in oscillatory profiles (which is the case for molting genes).
If we look at the global, soma, or germline correlation profiles of one of these samples, we can see 2 peaks of similar correlation for the soma, which is not the case for germline or global correlation.
This is the perfect situation to use a prior for staging. We can input the global age as a prior so that the correct peak of the soma correlation profile is favored in the erroneous samples.
ae_soma_prior <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$soma,],
r_ya,
prior = ae_dsrockman2010$age.estimates[,1], # gaussian prior values (mean)
prior.params = 10 # gaussian prior sd
)
#> nb.genes
#> refdata 16327
#> samp 2005
#> intersect.genes 1815
#> Bootstrap set size is 605
#> Performing age estimation...
#> Bootstrapping...
#> Building gene subsets...
#> Computing correlations...
#> Performing age estimation...
#> Computing summary statistics...
Indeed, the estimate is now shifted to the first (correct) peak. Note that the correlation profile itself (central black line) is unchanged (dotted lines can be different as they are derived from bootstrap estimates with random gene sets).
At the same time, all the other estimates are essentially unchanged.
# 80 & 141 are the offset samples
mean((ae_soma$age.estimates[,1] - ae_soma_prior$age.estimates[,1])[-c(80,141)])
#> [1] -0.014664
Now, we can observe the expression dynamics in the data along tissue-specific estimates.
As expected, components IC1, IC4, IC5, and IC7 that we previously associated with soma are much cleaner along the soma age, but very noisy along germline age.
At the same time, germline-associated components IC2 and IC3 appear quite noisy along soma age, but are very clean along germline age estimates.
This effect is the result of soma-germline heterochrony between the samples.
Note: In this vignette, we use Cel_YA_1 to stage the samples, which is different from the RAPToR article (Bulteau and Francesconi (2022)) which uses Cel_larv_YA. The article shows a combination of soma-germline heterochrony between the reference and the samples, as well as among the samples. We shortened the analysis for the vignette to only show heterochrony among RILs (e.g. the ICA does not combine reference and samples, so components are different; however, enrichment of soma or germline genes to specific components/dynamics is still clear).
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("readxl", quietly = T)
requireNamespace("GEOquery", quietly = T) # bioconductor
requireNamespace("Biobase", quietly = T) # bioconductor
requireNamespace("limma", quietly = T) # bioconductor
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
To generate dsrockman2010:
geo_dsrockman2010 <- "GSE23857"
geo_dsrockman2010 <- GEOquery::getGEO(geo_dsrockman2010, GSEMatrix = F)
# get pdata
P_dsrockman2010 <- do.call(
rbind,
lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
unlist(GEOquery::Meta(go)[
c("geo_accession", "characteristics_ch1", "characteristics_ch2",
"label_ch1", "label_ch2")]
)
})
)
P_dsrockman2010 <- as.data.frame(P_dsrockman2010, stringsAsFactors = F)
# get RG data from microarray
RG_dsrockman2010 <- list(
R = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "R_MEAN_SIGNAL"]
})),
G = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "G_MEAN_SIGNAL"]
}))
)
# normalize within channels
RG_dsrockman2010 <- limma::normalizeWithinArrays(RG_dsrockman2010,
method = "loess")
RG_dsrockman2010 <- limma::RG.MA(RG_dsrockman2010) # convert back from MA to RG
# get only the RIALs (not the mixed stage controls)
X_dsrockman2010 <- RG_dsrockman2010$R
X_dsrockman2010[, P_dsrockman2010$label_ch1 == "Cy5"] <-
RG_dsrockman2010$G[, P_dsrockman2010$label_ch1 == "Cy5"]
# format probe/gene ids
gpl <- GEOquery::getGEO(GEOquery::Meta(
GEOquery::GSMList(geo_dsrockman2010)[[1]]
)$platform_id)
gpl <- GEOquery::Table(gpl)
gpl <- gpl[as.character(gpl$ID) %in% as.character(GEOquery::Table(
GEOquery::GSMList(geo_dsrockman2010)[[1]]
)$ID_REF), ]
sel <- gpl$ORF%in%wormRef::Cel_genes$sequence_name
gpl <- gpl[sel,]
X_dsrockman2010 <- X_dsrockman2010[sel,]
# filter bad quality samples
cm_dsrockman2010 <- cor(log1p(X_dsrockman2010), method = 'spearman')
f_dsrockman2010 <- which(0.95 > apply(cm_dsrockman2010, 1,
quantile, probs = .95))
X_dsrockman2010 <- X_dsrockman2010[,-f_dsrockman2010]
P_dsrockman2010 <- P_dsrockman2010[-f_dsrockman2010,]
# format ids
rownames(X_dsrockman2010) <- gpl$ORF
X_dsrockman2010 <- RAPToR::format_ids(X_dsrockman2010, wormRef::Cel_genes,
from = "sequence_name", to = "wb_id")
dsrockman2010 <- list(g = X_dsrockman2010, p = P_dsrockman2010)
save(dsrockman2010,
file = paste0(data_folder, "dsrockman2010.RData"), compress = "xz")
rm(geo_dsrockman2010, gpl, sel,
RG_dsrockman2010, X_dsrockman2010, P_dsrockman2010,
cm_dsrockman2010, f_dsrockman2010)
To download the soma and germline gene sets (gsubset) :
library(readxl)
# germline sets from Perez et al. (2017)
germline_url <- paste0("https://static-content.springer.com/esm/",
"art%3A10.1038%2Fnature25012/MediaObjects/",
"41586_2017_BFnature25012_MOESM3_ESM.xlsx")
germline_file <- paste0(data_folder, "germline_gset.xlsx")
utils::download.file(url = germline_url, destfile = germline_file)
germline_set <- read_xlsx(germline_file, sheet = 3, na = "NA")[,c(1, 44:46)]
germline_set[is.na(germline_set)] <- FALSE
germline_set <- cbind(wb_id = germline_set[,1],
germline = apply(germline_set[, 2:4], 1,
function(r) any(r)),
germline_set[, 2:4])
germline <- germline_set[germline_set$germline,1]
germline_intrinsic <- germline_set[germline_set$germline_intrinsic,1]
germline_oogenesis <- germline_set[germline_set$germline_oogenesis_enriched,1]
germline_sperm <- germline_set[germline_set$germline_sperm_enriched,1]
# soma set from Hendriks et al. (2014)
soma_url <- paste0("https://ars.els-cdn.com/content/image/",
"1-s2.0-S1097276513009039-mmc2.xlsx")
soma_file <- paste0(data_folder, "soma_gset.xlsx")
utils::download.file(url = soma_url, destfile = soma_file)
soma_set <- read_xlsx(soma_file, skip = 3, na = "NA")[,c(1, 4)]
soma_set$class <- factor(soma_set$class)
soma_set$soma <- soma_set$class == "osc"
soma_set <- soma_set[soma_set$soma, 1]
# save gene sets
gsubset <- list(germline = germline, soma = soma_set$`Gene WB ID`,
germline_intrinsic = germline_intrinsic,
germline_oogenesis = germline_oogenesis,
germline_sperm = germline_sperm)
save(gsubset, file = paste0(data_folder, "sc3_gsubset.RData"), compress = "xz")
# cleanup
file.remove(germline_file)
file.remove(soma_file)
rm(germline_url, germline_file, germline_set,
soma_url, soma_file, soma_set)
Francesconi and Lehner (2014) sample timings (francesconi_time):
# Copied from supp data of Francesconi & Lehner (2014)
francesconi_time <- data.frame(
time = c(
4.862660944, 4.957081545, 5.051502146, 5.145922747, 5.240343348, 5.334763948,
5.429184549, 5.52360515, 5.618025751, 5.712446352, 5.806866953, 5.901287554,
5.995708155, 6.090128755, 6.184549356, 6.278969957, 6.373390558, 6.467811159,
6.56223176, 6.656652361, 6.751072961, 6.845493562, 6.939914163, 7.034334764,
7.128755365, 7.223175966, 7.317596567, 7.412017167, 7.506437768, 7.600858369,
7.69527897, 7.789699571, 7.884120172, 7.978540773, 8.072961373, 8.167381974,
8.261802575, 8.356223176, 8.450643777, 8.545064378, 8.639484979, 8.733905579,
8.82832618, 8.82832618, 8.82832618, 8.875536481, 8.875536481, 8.875536481,
8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481,
8.969957082, 9.017167382, 9.017167382, 9.064377682, 9.064377682, 9.111587983,
9.206008584, 9.206008584, 9.206008584, 9.300429185, 9.394849785, 9.489270386,
9.489270386, 9.489270386, 9.489270386, 9.489270386, 9.583690987, 9.583690987,
9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987,
9.630901288, 9.725321888, 9.819742489, 9.819742489, 9.819742489, 9.819742489,
9.819742489, 9.819742489, 9.819742489, 9.91416309, 10.00858369, 10.05579399,
10.05579399, 10.05579399, 10.05579399, 10.10300429, 10.19742489, 10.19742489,
10.29184549, 10.29184549, 10.29184549, 10.38626609, 10.38626609, 10.38626609,
10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639,
10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639,
10.527897, 10.6223176, 10.6223176, 10.6223176, 10.6223176, 10.6223176,
10.6223176, 10.6223176, 10.6695279, 10.6695279, 10.6695279, 10.7639485,
10.7639485, 10.7639485, 10.8583691, 10.8583691, 10.9527897, 10.9527897,
10.9527897, 11.0472103, 11.1416309, 11.2360515, 11.2360515, 11.3304721,
11.3304721, 11.3776824, 11.472103, 11.56652361, 11.66094421, 11.75536481,
11.84978541, 11.94420601, 12.03862661, 12.13304721, 12.22746781, 12.32188841,
12.41630901, 12.51072961, 12.60515021, 12.69957082, 12.79399142, 12.88841202,
12.98283262, 13.07725322, 13.17167382, 13.26609442, 13.36051502, 13.45493562,
13.54935622, 13.54935622, 13.54935622, 13.54935622, 13.54935622, 13.59656652,
13.69098712, 13.78540773, 13.78540773, 13.78540773, 13.87982833, 13.97424893,
14.06866953, 14.06866953, 14.06866953, 14.16309013, 14.25751073, 14.35193133,
14.44635193, 14.54077253, 14.63519313, 14.72961373, 14.82403433, 14.82403433,
14.82403433, 14.91845494, 14.96566524, 15.01287554, 15.10729614, 15.20171674,
15.29613734, 15.39055794, 15.48497854, 15.57939914, 15.67381974, 15.76824034,
15.86266094, 15.95708155, 16.05150215, 16.14592275, 16.24034335, 16.33476395,
16.42918455, 16.52360515),
geo_accession = c(
"GSM588291", "GSM588174", "GSM588110", "GSM588097", "GSM588271", "GSM588203",
"GSM588105", "GSM588200", "GSM588123", "GSM588122", "GSM588115", "GSM588100",
"GSM588171", "GSM588190", "GSM588229", "GSM588206", "GSM588277", "GSM588129",
"GSM588175", "GSM588151", "GSM588273", "GSM588216", "GSM588099", "GSM588117",
"GSM588179", "GSM588164", "GSM588184", "GSM588092", "GSM588285", "GSM588272",
"GSM588228", "GSM588121", "GSM588170", "GSM588194", "GSM588143", "GSM588149",
"GSM588156", "GSM588220", "GSM588212", "GSM588089", "GSM588209", "GSM588253",
"GSM588091", "GSM588113", "GSM588130", "GSM588202", "GSM588191", "GSM588244",
"GSM588227", "GSM588197", "GSM588233", "GSM588292", "GSM588163", "GSM588196",
"GSM588224", "GSM588283", "GSM588267", "GSM588257", "GSM588221", "GSM588274",
"GSM588090", "GSM588114", "GSM588195", "GSM588265", "GSM588182", "GSM588093",
"GSM588157", "GSM588251", "GSM588177", "GSM588188", "GSM588269", "GSM588145",
"GSM588205", "GSM588162", "GSM588210", "GSM588166", "GSM588125", "GSM588252",
"GSM588207", "GSM588173", "GSM588102", "GSM588286", "GSM588107", "GSM588238",
"GSM588189", "GSM588106", "GSM588295", "GSM588192", "GSM588134", "GSM588183",
"GSM588103", "GSM588198", "GSM588293", "GSM588218", "GSM588259", "GSM588234",
"GSM588137", "GSM588152", "GSM588133", "GSM588250", "GSM588168", "GSM588235",
"GSM588148", "GSM588279", "GSM588140", "GSM588241", "GSM588111", "GSM588231",
"GSM588128", "GSM588131", "GSM588101", "GSM588088", "GSM588281", "GSM588159",
"GSM588249", "GSM588290", "GSM588118", "GSM588154", "GSM588136", "GSM588268",
"GSM588204", "GSM588160", "GSM588135", "GSM588098", "GSM588294", "GSM588225",
"GSM588181", "GSM588248", "GSM588096", "GSM588217", "GSM588147", "GSM588176",
"GSM588116", "GSM588146", "GSM588127", "GSM588104", "GSM588108", "GSM588262",
"GSM588223", "GSM588161", "GSM588237", "GSM588172", "GSM588284", "GSM588256",
"GSM588165", "GSM588211", "GSM588242", "GSM588169", "GSM588240", "GSM588264",
"GSM588219", "GSM588287", "GSM588124", "GSM588178", "GSM588167", "GSM588258",
"GSM588232", "GSM588141", "GSM588112", "GSM588208", "GSM588215", "GSM588132",
"GSM588278", "GSM588275", "GSM588155", "GSM588153", "GSM588109", "GSM588185",
"GSM588138", "GSM588094", "GSM588226", "GSM588236", "GSM588266", "GSM588119",
"GSM588222", "GSM588246", "GSM588150", "GSM588261", "GSM588201", "GSM588247",
"GSM588186", "GSM588280", "GSM588255", "GSM588288", "GSM588260", "GSM588139",
"GSM588245", "GSM588270", "GSM588276", "GSM588199", "GSM588254", "GSM588120",
"GSM588144", "GSM588243", "GSM588214", "GSM588180", "GSM588126", "GSM588282",
"GSM588187", "GSM588158", "GSM588193", "GSM588213", "GSM588230", "GSM588263",
"GSM588142", "GSM588095"),
stringsAsFactors = F)
rownames(francesconi_time) <- francesconi_time$geo_accession
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=C LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] splines parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] vioplot_0.4.0 zoo_1.8-11 sm_2.2-5.7.1 ica_1.0-3
#> [5] drosoRef_0.2.0 limma_3.50.3 wormRef_0.5.0 RAPToR_1.2.0
#> [9] ggpubr_0.6.0 ggplot2_3.4.0.9000 BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 lattice_0.20-45 tidyr_1.3.0
#> [4] digest_0.6.31 utf8_1.2.3 R6_2.5.1
#> [7] backports_1.4.1 evaluate_0.20 highr_0.10
#> [10] pillar_1.8.1 Rdpack_2.4 rlang_1.0.6
#> [13] rstudioapi_0.14 data.table_1.14.6 car_3.1-1
#> [16] jquerylib_0.1.4 magick_2.7.3 Matrix_1.5-3
#> [19] rmarkdown_2.20.1 labeling_0.4.2 stringr_1.5.0
#> [22] tinytex_0.44 munsell_0.5.0 broom_1.0.3
#> [25] compiler_4.1.2 xfun_0.37 pkgconfig_2.0.3
#> [28] mgcv_1.8-41 htmltools_0.5.4 tidyselect_1.2.0
#> [31] tibble_3.1.8 bookdown_0.32 codetools_0.2-18
#> [34] fansi_1.0.4 dplyr_1.1.0 withr_2.5.0
#> [37] rbibutils_2.2.13 grid_4.1.2 nlme_3.1-161
#> [40] jsonlite_1.8.4 gtable_0.3.1 lifecycle_1.0.3
#> [43] magrittr_2.0.3 scales_1.2.1 cli_3.6.0
#> [46] stringi_1.7.12 cachem_1.0.6 carData_3.0-5
#> [49] farver_2.1.1 ggsignif_0.6.4 pryr_0.1.6
#> [52] bslib_0.4.2 generics_0.1.3 vctrs_0.5.2
#> [55] tools_4.1.2 glue_1.6.2 beeswarm_0.4.0
#> [58] purrr_1.0.1 abind_1.4-5 fastmap_1.1.0
#> [61] yaml_2.3.7 colorspace_2.1-0 BiocManager_1.30.19
#> [64] rstatix_0.7.2 knitr_1.42 sass_0.4.5